name: toc

Table of Contents


class: inverse, center, middle

Pre-bootcamp HW

Talk about your analysis on the
weather data
in groups of 2-3


Let’s get it all out there

When you think of R what comes to mind?


name: whyr

Why are you here wanting to learn R?


Why should you learn R?

  • R is free, R improves, R has existed for 40+ years
  • Students can show what they have learned to a potential employer easily
  • The support system for R is actually much better than some people say

But as a professor, why should you learn R?

  • R and R Markdown will save you TONS of time. Long term thinking is key.
    • You can easily tweak your code if you need to do another analysis
    • Remembering what drop-down menu some thing is in two years from now in a different program will be hard
    • Remembering to copy-and-paste your updated plots/analysis into your word processor is a pain and error prone

But as a professor, why should you learn R?


class: inverse, center, middle

DEMO in RStudio
with R Markdown


class: center, middle

Analogy

R RStudio DataCamp
Drawing Drawing Drawing

What this bootcamp is

  • Designed for all levels of knowledge and background

What this bootcamp is not

  • A thorough development of machine learning
    techniques applied to big data

  • A discussion on the role of p-values in science

  • A tutorial on how to work with base R subsetting and base R graphics

What I hope you’ll learn

  • That R is not as scary as it used to be.

  • Learning how to Google is an incredibly valuable skill.

  • That having students turn in assignments using R scripts and R Markdown provides an efficient way to check their work and their analyses easily.

What I hope you’ll learn

  • That the R packages you will use in this workshop are the same ones that are used by scientists and graduate programs all over the world

  • That teaching students how to use open-source tools is what is best for them long term


Pieces of advice

  • Scaffold & support as a foreign languages do

  • To be able to use R (and really any other language), students need more than a few assignments and more than a weekly lab

  • Make use of RStudio Projects (students have a really hard time navigating directory structures)

Pieces of advice

My opinion

  • Have students work on writing their code in R script files and documenting their errors
    • This is the same workflow that DataCamp uses

  • Show students R Markdown after a few weeks of working with the script file
    • I’ve found it is hard for students to learn R and R Markdown from the start
    • Better to have them use R Markdown in groups initially

class: inverse, center, middle

R Data Types


The bare minimum needed for understanding today

Vector/variable - Type of vector (int, num, chr, lgl, date)

Data frame - Vectors of (potentially) different types - Each vector has the same number of rows


The bare minimum needed for understanding today

library(tibble)
library(lubridate)
ex1 <- data_frame(
    vec1 = c(1980, 1990, 2000, 2010),
    vec2 = c(1L, 2L, 3L, 4L),
    vec3 = c("low", "low", "high", "high"),
    vec4 = c(TRUE, FALSE, FALSE, FALSE),
    vec5 = ymd(c("2017-05-23", "1776/07/04", "1983-05/31", "1908/04-01"))
  )
ex1


class: center, middle

Welcome to the tidyverse!

The tidyverse is a collection of R packages that share common philosophies and are designed to work together.


Beginning steps

Frequently the first thing to do when given a dataset is to - identify the observational unit, - specify the variables, - give the types of variables you are presented with, and - check that the data is tidy. (TO COME)

This will help with - choosing the appropriate plot, - summarizing the data, and - understanding which inferences/models can be applied.


class: inverse, center, middle name: viz

Data Visualization

Inspired by Hans Rosling


  • What are the variables here?
  • What is the observational unit?
  • How are the variables mapped to aesthetics?

Grammar of Graphics

.left-column[ Wilkinson (2005) laid out the proposed “Grammar of Graphics”]

.right-column[ ]


Grammar of Graphics in R

.left-column[ Wickham implemented the grammar in R in the ggplot2 package]

.right-column[ ]


Grammar of Graphics elsewhere


class: center, middle

# What is a statistical graphic?

## A mapping of
data variables

to
aes()thetic attributes

## of
geom_etric objects.

class: inverse, center, middle

Back to basics


Consider the following data in tidy format:

simple_ex <-
  data_frame(
    A = c(1980, 1990, 2000, 2010),
    B = c(1, 2, 3, 4),
    C = c(3, 2, 1, 2),
    D = c("low", "low", "high", "high")
  )
simple_ex

Consider the following data in tidy format:

A B C D
1980 1 3 low
1990 2 2 low
2000 3 1 high
2010 4 2 high
  • Sketch the graphics below on paper, where the x-axis is variable A and the y-axis is variable B
  1. A scatter plot
  2. A scatter plot where the color of the points corresponds to D
  3. A scatter plot where the size of the points corresponds to C
  4. A line graph
  5. A line graph where the color of the line corresponds to D with points added that are all forestgreen of size 4.

Reproducing the plots in ggplot2

1. A scatterplot

library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_point()


Reproducing the plots in ggplot2

2. A scatter plot where the color of the points corresponds to group

library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_point(mapping = aes(color = D))


Reproducing the plots in ggplot2

3. A scatter plot where the size of the points corresponds to C

library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B, size = C)) + 
  geom_point()


Reproducing the plots in ggplot2

4. A line graph

library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_line()


Reproducing the plots in ggplot2

5. A line graph where the color of the line corresponds to D with points added that are all blue of size 4.

library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_line(mapping = aes(color = D)) +
  geom_point(color = "forestgreen", size = 4)


The Five-Named Graphs

The 5NG of data viz

  • Scatterplot: geom_point()
  • Line graph: geom_line()

  • Histogram: geom_histogram()
  • Boxplot: geom_boxplot()
  • Bar graph: geom_bar()

class: center, middle

More ggplot2 examples


Histogram

library(nycflights13)
ggplot(data = weather, mapping = aes(x = humid)) +
  geom_histogram(bins = 20, color = "black", fill = "darkorange")


Boxplot (broken)

library(nycflights13)
ggplot(data = weather, mapping = aes(x = month, y = humid)) +
  geom_boxplot()


Boxplot (almost fixed)

library(nycflights13)
ggplot(data = weather, mapping = aes(x = month, group = month, y = humid)) +
  geom_boxplot() 

Boxplot (fixed)

library(nycflights13)
ggplot(data = weather, mapping = aes(x = month, group = month, y = humid)) +
  geom_boxplot() +
  scale_x_continuous(breaks = 1:12)


Bar graph

library(fivethirtyeight)
ggplot(data = bechdel, mapping = aes(x = clean_test)) +
  geom_bar()


How about over time?

  • Hop into dplyr
library(dplyr)
year_bins <- c("'70-'74", "'75-'79", "'80-'84", "'85-'89",
               "'90-'94", "'95-'99", "'00-'04", "'05-'09",
               "'10-'13")
bechdel <- bechdel %>%
  mutate(five_year = cut(year, 
                         breaks = seq(1969, 2014, 5), 
                         labels = year_bins)) %>% 
  mutate(clean_test = factor(clean_test, 
                             levels = c("nowomen", "notalk", "men",
                                        "dubious", "ok")))

How about over time? (Stacked)

library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
       mapping = aes(x = five_year, fill = clean_test)) +
  geom_bar()


How about over time? (Side-by-side)

library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
       mapping = aes(x = five_year, fill = clean_test)) +
  geom_bar(position = "dodge")


How about over time? (Stacked proportional)

library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
       mapping = aes(x = five_year, fill = clean_test)) +
  geom_bar(position = "fill", color = "black")


class: center, middle

The tidyverse/ggplot2 is for beginners and
for data science professionals!


Practice

Produce appropriate 5NG with R package & data set in [ ], e.g., [nycflights13 \(\rightarrow\) weather]

  1. Does age predict recline_rude?
    [fivethirtyeight \(\rightarrow\) na.omit(flying)]

  2. Distribution of age by sex
    [okcupiddata \(\rightarrow\) profiles]

  3. Does budget predict rating?
    [ggplot2movies \(\rightarrow\) movies]

  4. Distribution of log base 10 scale of budget_2013
    [fivethirtyeight \(\rightarrow\) bechdel]


HINTS


class: inverse, center, middle

DEMO of ggplot2 in RStudio


class: center, middle

Day 2

Data Wrangling


gapminder data frame in the gapminder package

library(gapminder)
gapminder

Base R versus the tidyverse

Say we wanted mean life expectancy across all years for Asia

# Base R
asia <- gapminder[gapminder$continent == "Asia", ]
mean(asia$lifeExp)
[1] 60.0649

library(dplyr)
gapminder %>% filter(continent == "Asia") %>%
  summarize(mean_exp = mean(lifeExp))

The pipe %>%

   

  • A way to chain together commands

  • It is essentially the dplyr equivalent to the
    + in ggplot2

## The 5NG of data viz

geom_point()
geom_line()
geom_histogram()
geom_boxplot()
geom_bar()


The Five Main Verbs (5MV) of data wrangling

filter()
summarize()
group_by()
mutate()
arrange()


filter()

  • Select a subset of the rows of a data frame.

  • The arguments are the “filters” that you’d like to apply.

library(gapminder); library(dplyr)
gap_2007 <- gapminder %>% filter(year == 2007)
head(gap_2007)
  • Use == to compare a variable to a value

Logical operators

  • Use | to check for any in multiple filters being true:

gapminder %>% 
  filter(year == 2002 | continent == "Europe")


Logical operators

  • Use & or , to check for all of multiple filters being true:

gapminder %>% 
  filter(year == 2002, continent == "Europe")

Logical operators

  • Use %in% to check for any being true
    (shortcut to using | repeatedly with ==)

gapminder %>% 
  filter(country %in% c("Argentina", "Belgium", "Mexico"),
         year %in% c(1987, 1992))


summarize()

  • Any numerical summary that you want to apply to a column of a data frame is specified within summarize().
max_exp_1997 <- gapminder %>% 
  filter(year == 1997) %>% 
  summarize(max_exp = max(lifeExp))
max_exp_1997


Combining summarize() with group_by()

When you’d like to determine a numerical summary for all levels of a different categorical variable

max_exp_1997_by_cont <- gapminder %>% 
  filter(year == 1997) %>% 
  group_by(continent) %>%
  summarize(max_exp = max(lifeExp))
max_exp_1997_by_cont

Without the %>%

It’s hard to appreciate the %>% without seeing what the code would look like without it:

max_exp_1997_by_cont <- 
  summarize(
    group_by(
      filter(
        gapminder, 
          year == 1997), 
      continent),
    max_exp = max(lifeExp))
max_exp_1997_by_cont

ggplot2 revisited

For aggregated data, use geom_col

ggplot(data = max_exp_1997_by_cont, 
       mapping = aes(x = continent, y = max_exp)) +
  geom_col()

The 5MV

  • filter()
  • summarize()
  • group_by()

  • mutate()
- arrange()

mutate()

  • Allows you to
    1. create a new variable with a specific value OR
    2. create a new variable based on other variables OR
    3. change the contents of an existing variable

gap_plus <- gapminder %>% mutate(just_one = 1)
head(gap_plus)

mutate()

  • Allows you to
    1. create a new variable with a specific value OR
    2. create a new variable based on other variables OR
    3. change the contents of an existing variable

gap_w_gdp <- gapminder %>% mutate(gdp = pop * gdpPercap)
head(gap_w_gdp)

mutate()

  • Allows you to
    1. create a new variable with a specific value OR
    2. create a new variable based on other variables OR
    3. change the contents of an existing variable

gap_weird <- gapminder %>% mutate(pop = pop + 1000)
head(gap_weird)

arrange()

  • Reorders the rows in a data frame based on the values of one or more variables

gapminder %>%
  arrange(year, country)

arrange()

  • Can also put into descending order

gapminder %>%
  filter(year > 2000) %>%
  arrange(desc(lifeExp))

Don’t mix up arrange and group_by

  • group_by is used (mostly) with summarize to calculate summaries over groups

  • arrange is used for sorting


Don’t mix up arrange and group_by

This doesn’t really do anything useful

gapminder %>% group_by(year)

Don’t mix up arrange and group_by

But this does

gapminder %>% arrange(year)

Changing of observation unit

True or False > Each of filter, mutate, and arrange change the observational unit.

True or False > group_by() %>% summarize() changes the observational unit.


What is meant by “joining data frames” and
why is it useful?


Does cost of living in a state relate to whether police officers live in the cities they patrol? What about state political ideology?

library(fivethirtyeight); library(readr); library(dplyr)
police2 <- police_locals %>% select(city, all)
ideology <- read_csv("https://ismayc.github.io/Effective-Data-Storytelling-using-the-tidyverse/datasets/ideology.csv")
police_join <- inner_join(x = police2, y = ideology, by = "city")
police_join


url <- paste0("https://ismayc.github.io/",
              "Effective-Data-Storytelling-using-the-tidyverse/",
              "datasets/cost_of_living.csv")
cost_of_living <- read_csv(url)
police_join_cost <- inner_join(
    x = police_join, 
    y = cost_of_living, 
    by = "state"
  )
police_join_cost %>% select(-state) %>% arrange(city)

Does cost of living in a state relate to whether police officers live in the cities they patrol? What about state political ideology?

ggplot(data = police_join_cost,
       mapping = aes(x = index, y = all)) +
  geom_point(mapping = aes(color = state_ideology)) +
  labs(x = "Cost of Living Index", y = "% Officers Living in City")


DEMO of dplyr in RStudio


class: inverse, center, middle

Data Importing and Tidying


tidyverse packages

IMPORT

  • haven for SPSS, SAS, and Stata data files
  • jsonlite for JSON files
  • readxl for XLS and XLSX files
  • readr for CSV and TSV files (and R specific RDS files)


TIDYING

  • tidyr for converting “messy” into “tidy” data frames

name: import

Basics of Importing

We will begin by downloading and importing a variety of different “messy” data sets. You can download all of them in a ZIP file at http://bit.ly/reed-messy-data. The links below go to the original sources. I’ve converted these original sources into different formats.


class: center, middle, inverse

Demonstration in RStudio


Practice


name: tidy class: inverse, middle, center

Tidy Data


  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

The third point means we don’t mix apples and oranges.


What is Tidy Data?

  1. Each observation forms a row. In other words, each row corresponds to a single instance of an observational unit
  2. Each variable forms a column:
    • Some variables may be used to identify the observational units.
    • For organizational purposes, it’s generally better to put these in the left-hand columns
  3. Each type of observational unit forms a table with the entries in the table corresponding to values.

Differentiating between neat data and tidy data

  • Colloquially, they mean the same thing
  • But in our context, one is a subset of the other.


Neat data is - easy to look at, - organized nicely, and - in table form.

Tidy data is neat but also abides by a set of three rules.


class: center, middle

Drawing


Is this tidy?

year title clean_test budget_2013
1995 Apollo 13 ok 99370665
2005 Brokeback Mountain notalk 16583160
2010 Diary of a Wimpy Kid ok 16023478
1984 Dune dubious 100864980
1984 Ghostbusters notalk 67243320
2003 How to Lose a Guy in 10 Days men 63304348
2011 Iris ok 5696299
2004 Sideways ok 20964279
2000 Songcatcher ok 2435235
2004 Team America: World Police men 24663858
2010 Tron Legacy notalk 213646368
2011 War Horse notalk 72498355

name: demscore

How about this? Is this tidy?

country 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
Albania -9 -9 -9 -9 -9 -9 -9 -9 5 5 7 9
Argentina -9 -1 -1 -9 -9 -9 -8 8 7 7 8 8
Armenia -9 -7 -7 -7 -7 -7 -7 -7 7 -6 5 5
Australia 10 10 10 10 10 10 10 10 10 10 10 10
Austria 10 10 10 10 10 10 10 10 10 10 10 10
Azerbaijan -9 -7 -7 -7 -7 -7 -7 -7 1 -6 -7 -7
Belarus -9 -7 -7 -7 -7 -7 -7 -7 7 -7 -7 -7
Belgium 10 10 10 10 10 10 10 10 10 10 10 8
Bhutan -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -6
Bolivia -4 -3 -3 -4 -7 -7 8 9 9 9 9 8
Brazil 5 5 5 -9 -9 -4 -3 7 8 8 8 8
Bulgaria -7 -7 -7 -7 -7 -7 -7 -7 8 8 9 9

name: whytidy

Why is tidy data important?

  • Think about trying to plot democracy score across years in the simplest way possible.

  • It would be much easier if the data looked like what follows instead so we could put
    • year on the x-axis and
    • dem_score on the y-axis.

Tidy is good

country year dem_score
Argentina 1962 -1
Armenia 1997 -6
Denmark 1962 10
Ethiopia 2007 1
Finland 2007 10
Ireland 1992 10
Libya 1957 -7
Libya 1982 -7
Mexico 1977 -3
Mexico 1982 -3
Spain 1962 -7
Switzerland 1982 10
Ukraine 1997 7

Let’s plot it

  • Plot the line graph for three countries using ggplot
dem_score4 <- dem_score_tidy %>%
  filter(country %in% c("Pakistan", "Portugal", "Uruguay"))
ggplot(data = dem_score4, mapping = aes(x = year, y = dem_score)) +
  geom_line(mapping = aes(color = country))


Tidying

The Life Expectancy by year data

library(readr)
life_exp_df <- read_csv("data/le_mess.csv")
#View(life_exp_df)


Doing the “tidying”/reshaping

library(tidyr)
library(dplyr)
(life_exp_tidy <- life_exp_df %>% 
    gather(key = "year", value = "life_exp", -country))

Tidying

World Health Organization TB data (Stata DTA)

library(haven)
who_df <- read_dta("data/who.dta")
#View(who_df)


WHO ugly…

  • This data set contains strange variable names that will require us to look up their meaning in the corresponding data dictionary.

  • What do we know?
    • country, iso2, and iso3 are redundant and identify the country
    • year is a variable and it looks like it corresponds to each country
  • But what in the world is new_sp_m014? And the other columns?…

First step

Like before, we can gather these non-country and non-year variables together:

who_tidy <- who_df %>% 
  gather(new_sp_m014:newrel_f65, key = "key", value = "value")

We can now see what this has done to the data frame:

View(who_tidy)

Data dictionary saves us some…

  1. The first three letters of entries in the key column correspond to new or old cases of TB.
  2. The next two letters (after the _) correspond to TB type:
    • rel for relapse, ep for extrapulmonary TB
    • sn for smear negative, sp for smear positive
  3. The next letter after the second _ corresponds to the sex of the TB patient.
  4. The remaining numbers correspond to age group:
    • 014 for 0 to 14 years
    • 65 for 65 or older

  • Looking over the entries of key in who_tidy, do you see anything else that doesn’t match the pattern?

  • It may be easier to remember that the entries of key correspond to column names in who_df:

names(who_df)
 [1] "country"      "iso2"         "iso3"         "year"        
 [5] "new_sp_m014"  "new_sp_m1524" "new_sp_m2534" "new_sp_m3544"
 [9] "new_sp_m4554" "new_sp_m5564" "new_sp_m65"   "new_sp_f014" 
[13] "new_sp_f1524" "new_sp_f2534" "new_sp_f3544" "new_sp_f4554"
[17] "new_sp_f5564" "new_sp_f65"   "new_sn_m014"  "new_sn_m1524"
[21] "new_sn_m2534" "new_sn_m3544" "new_sn_m4554" "new_sn_m5564"
[25] "new_sn_m65"   "new_sn_f014"  "new_sn_f1524" "new_sn_f2534"
[29] "new_sn_f3544" "new_sn_f4554" "new_sn_f5564" "new_sn_f65"  
[33] "new_ep_m014"  "new_ep_m1524" "new_ep_m2534" "new_ep_m3544"
[37] "new_ep_m4554" "new_ep_m5564" "new_ep_m65"   "new_ep_f014" 
[41] "new_ep_f1524" "new_ep_f2534" "new_ep_f3544" "new_ep_f4554"
[45] "new_ep_f5564" "new_ep_f65"   "newrel_m014"  "newrel_m1524"
[49] "newrel_m2534" "newrel_m3544" "newrel_m4554" "newrel_m5564"
[53] "newrel_m65"   "newrel_f014"  "newrel_f1524" "newrel_f2534"
[57] "newrel_f3544" "newrel_f4554" "newrel_f5564" "newrel_f65"  

A fix using stringr

You can replace all of the entries in key that contained newrel with new_rel via

library(stringr)
library(dplyr)
who_tidy <- who_tidy %>% 
  mutate(key = str_replace(
      string = key, 
      pattern = "newrel", 
      replacement = "new_rel")
    )

Pulling apart variables

The entry new_sp_m014 is actually four different variables. Use the separate function to pull them apart:

who_tidy <- who_tidy %>% 
  separate(col = key, into = c("new", "type", "sexage"), sep = "_")
who_tidy

One more pull apart

  • We need to pull sexage into two different variables.
  • If you use ?separate, you’ll see that the following is an option:
(who_tidy <- who_tidy %>% 
  separate(col = sexage, into = c("sex", "age"), sep = 1))

Final cleaning

  • iso2 and iso3 are redundant if we have country
  • new is constant since this only has new cases of TB
  • Let’s just ignore missing values here
  • We also know that value is actually cases so we can rename that column.
who_tidy <- who_tidy %>% select(-iso2, -iso3, -new) %>% 
  na.omit() %>% rename("cases" = value)
head(who_tidy, 5)

The power of the %>% (pipe)

Everything we did above can be chained into one sequence:

who_tidy <- who_df %>% 
  gather(new_sp_m014:newrel_f65, key = "key", value = "value") %>%  
  mutate(key = str_replace(key, pattern = "newrel", 
                           replacement = "new_rel")) %>% 
  separate(col = key, into = c("new", "type", "sexage"), sep = "_") %>% 
  separate(col = sexage, into = c("sex", "age"), sep = 1) %>% 
  select(-iso2, -iso3, -new) %>% 
  na.omit() %>% 
  rename("cases" = value)

BONUS

A Gapminder tidy dataset read in from a Google Sheet!

  • I’ve updated the gapminder data set available in the gapminder R data package by Jenny Bryan here. Jenny provides instructions for recreating the data.

BONUS

  • You can download the updated data from Google Sheets by running the following in R:
# Link is https://docs.google.com/spreadsheets/d/18L5ZiXd1CQ97XWSqb04x1YQ4ncsEOdR7eHzgX0ZIuPA/edit?usp=sharing
library(googlesheets)
updated_gap <- gs_key("18L5ZiXd1CQ97XWSqb04x1YQ4ncsEOdR7eHzgX0ZIuPA") %>%
  gs_read()
  • A script going through the steps to take the “messy” original Gapminder.org files and turn them into this tidy dataset is available here.

Practice (after the bootcamp)

Try to tidy the other data sets you downloaded and imported or other data sets you have!


name: model class: inverse, center, middle

Data Modeling


Modeling

  1. Experience with ggplot2 package and knowledge of the Grammar of Graphics primes students for modeling
  2. Use of the broom package to unpack regression output into a tidy format

1. ggplot2 Primes Regression

Example:

  • All Alaskan Airlines and Frontier flights leaving NYC in 2013
  • We want to study the relationship between temperature and departure delay
  • For summer (June, July, August) and non-summer months separately

Involves four variables:

  • carrier, temp, dep_delay, summer

1. ggplot2 Primes Regression

flights_subset <- flights %>% 
  filter(carrier == "AS" | carrier == "F9") %>% 
  left_join(weather, by=c("year", "month", "day", "hour", "origin")) %>% 
  filter(dep_delay < 250) %>% 
  mutate(summer = ifelse(month == 6 | month == 7 | month == 8, "Summer Flights", "Non-Summer Flights"))

ggplot(data = flights_subset, 
    mapping = aes(x = temp, y = dep_delay, color = carrier)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ summer)


1. ggplot2 Primes Regression

Why? Dig deeper into data. Look at origin and dest variables as well:


flights_subset %>% 
  group_by(carrier, origin, dest) %>% 
  summarise(`Number of Flights` = n())

2. broom Package

  • The broom package takes the messy output of built-in modeling functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.
  • Fits in with tidyverse ecosystem
  • This works for many R data types!

2. broom Package

In our case, broom functions take lm objects as inputs and return the following in tidy format!

  • tidy(): regression output table
  • augment(): point-by-point values (fitted values, residuals, predicted values)
  • glance(): scalar summaries like \(R^2\) ,

2. broom Package

Example

library(tidyverse)
library(nycflights13)
library(knitr)
library(broom)
set.seed(2017)

# Load Alaska data, deleting rows that have missing departure delay
# or arrival delay data
alaska_flights <- flights %>% 
  filter(carrier == "AS") %>% 
  filter(!is.na(dep_delay) & !is.na(arr_delay)) %>% 
  sample_n(50)

# Exploratory Data Analysis-----------------------------------------
# Plot of sample of points:
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
   geom_point()


# Correlation coefficient:
alaska_flights %>% summarize(correl = cor(dep_delay, arr_delay))
# Add regression line
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red")


# Fit Regression and Study Output with broom Package-------------------
# Fit regression
options(scipen = 6) # Set scientific notation
delay_fit <- lm(formula = arr_delay ~ dep_delay, data = alaska_flights)

# 1. broom::tidy() regression table with confidence intervals 
# and no p-value stars
regression_table <- delay_fit %>% 
  tidy(conf.int = TRUE)
regression_table

# 2. broom::augment() for point-by-point values
regression_points <- delay_fit %>% 
  augment() %>% 
  select(arr_delay, dep_delay, .fitted, .resid) 
regression_points

# and for prediction
new_flights <- data_frame(dep_delay = c(25, 30, 15))
delay_fit %>% 
  augment(newdata = new_flights)

# 3. broom::glance() scalar summaries of regression
regression_summaries <- delay_fit %>% 
  glance() 
regression_summaries

# Residual Analysis------------------------------------------------------
ggplot(data = regression_points, mapping = aes(x = .resid)) +
  geom_histogram(binwidth=10, color = "white") +
  geom_vline(xintercept = 0, color = "blue")


ggplot(data = regression_points, mapping = aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 0, color = "blue")


ggplot(data = regression_points, mapping = aes(sample = .resid)) +
  stat_qq()


class: center, middle, inverse name: rmarkdown

Data Communicating


What is Markdown?

  • A “plaintext formatting syntax”
  • Type in plain text, render to more complex formats
  • One step beyond writing a txt file
  • Render to HTML, PDF, DOCX, etc. using Pandoc

What does it look like?

.left-column[

  # Header 1
  
  ## Header 2
  
  Normal paragraphs of text go here.
  
  **I'm bold**
  
  [links!](http://rstudio.com)
  
   * Unordered
   * Lists   
   
  And  Tables
  ---- -------
  Like This
  

]

.right-column[ markdown]


What is R Markdown?

  • “Literate programming”
  • Embed R code in a Markdown document
  • Renders textual output along with graphics

.left-column[


```{r chunk1}
library(ggplot2)
library(nycflights13)
pdx_flights <- flights %>% 
  filter(dest == "PDX", month == 5)
nrow(pdx_flights)
```

```{r chunk2}
ggplot(data = pdx_flights,
  mapping = aes(x = arr_delay, 
                y = dep_delay)) +
  geom_point()
```

]

.right-column[

[1] 88

]


Practice

Turn a statistical analysis you have conducted into an R Markdown document.


class: middle

Thanks for attending!

---
title: Putting the [`R`]() in [`R`]()eed and <br> in Lewis and Cla[`R`]()k
author: "Chester Ismay <br><br> GitHub: [`ismayc`](https://github.com/ismayc) <br> Twitter:  [`@old_man_chester`](https://twitter.com/old_man_chester)"
date: 2017-05-25 & 2017-05-26 <br><br> Bootcamp website at <http://bit.ly/rbootcamp17> <br> Slides available at <http://bit.ly/rbootcamp17-slides>
output: 
    html_document:
      toc: true
      toc_float: true
      toc_depth: 1
      theme: cosmo
      highlight: tango
      df_print: paged
      code_download: true
---

name: toc

# Table of Contents

- [Data Viz](#viz)
- [Data Wrangling](#wrangle)
- [Data Importing](#import)
- [Data Tidying](#tidy)
- [Data Modeling](#model)
- [Data Communicating](#rmarkdown)


---

class: inverse, center, middle

<!-- Write slide link on whiteboard -->

```{r include=FALSE}
#options(servr.daemon = TRUE)
library(tidyverse)
library(mosaic)
library(stringr)
library(okcupiddata)
library(knitr)
filter <- dplyr::filter
knitr::opts_chunk$set(warning=FALSE, message=FALSE, fig.width=9.5, fig.height=4.5, 
  comment=NA, rows.print=16)
theme_set(theme_gray(base_size = 24))
```

# Pre-bootcamp HW

## Talk about your analysis on the <br> `weather` data <br> in groups of 2-3

---

## Let's get it all out there

## When you think of R what comes to mind?

---

name: whyr

## Why are you here wanting to learn R?

- ~~Because you enjoy partaking in [Schadenfreude](https://en.wikipedia.org/wiki/Schadenfreude)~~

<center>
<img src="figure/google_scholar.png" style="width: 700px;"/>
</center>

---

## Why should you learn R?

- R is free, R improves, R has existed for 40+ years
- Students can show what they have learned to a potential employer easily
- The support system for R is actually much better than some people say


---

## But as a professor, why should you learn R?

- R and R Markdown will save you TONS of time. Long term thinking is key.
    - You can easily tweak your code if you need to do another analysis
    - Remembering what drop-down menu some thing is in two years from now in a different program will be hard
    - Remembering to copy-and-paste your updated plots/analysis into your word processor is a pain and error prone

---

## But as a professor, why should you learn R?
    
<center>
<img src="figure/tweet.jpg" style="width: 700px;"/>
</center>

---

class: inverse, center, middle

# DEMO in RStudio <br> with R Markdown

---

class: center, middle

## Analogy

R            |  RStudio |  DataCamp
:-------------------------:|:-------------------------:|:-------------------------:
<img src="figure/engine.jpg" alt="Drawing" style="width: 350px;"/>  |  <img src="figure/dashboard.jpg" alt="Drawing" style="width: 350px;"/>  |  <img src="figure/instructor.jpg" alt="Drawing" style="width: 350px;"/>

---

## What this bootcamp is

![](https://ourcodingclub.github.io/img/tidyverse.png)

- Designed for all levels of knowledge and background

---

## What this bootcamp is not

- A thorough development of machine learning <br> techniques applied to big data

--

- A discussion on the role of _p_-values in science

--

- A tutorial on how to work with base R subsetting and base R graphics 

---

## What I hope you'll learn

- That R is not as scary as it used to be.  

--

- Learning how to Google is an incredibly valuable skill.

--

- That having students turn in assignments using R scripts and R Markdown provides an efficient way to check their work and their analyses easily.

---

## What I hope you'll learn

- That the R packages you will use in this workshop are the same ones that are used by scientists and graduate programs all over the world

--

- That teaching students how to use open-source tools is what is best for them long term

--

- That students with no programming background can do great things in only a few weeks
    - [Sociology/Criminal Justice majors at Pacific U.](https://ismayc.github.io/soc301_s2017/group-projects/index.html)
    - [A wide range of students at Middlebury College](https://rudeboybert.github.io/MATH116/PS/final_project/final_project_outline.html)

---

## Pieces of advice

<center>
<a href="http://giphy.com/gifs/pool-diving-synchronized-swimming-pDWtwK7D2IlFu"><img src="figure/giphy2.gif" style="width: 300px;"/></a>
</center>

- Scaffold & support as a foreign languages do

--

- To be able to use R (and really any other language), students need more than a few assignments and more than a weekly lab

--

- Make use of RStudio Projects (students have a really hard time navigating directory structures)


---

## Pieces of advice

### My opinion

- Have students work on writing their code in R script files and documenting their errors
    - This is the same workflow that DataCamp uses

--

- Show students [R Markdown after a few weeks](https://ismayc.github.io/soc301_s2017/schedule/) of working with the script file
    - I've found it is hard for students to learn R and R Markdown from the start
    - Better to have them use R Markdown in groups initially


---


class: inverse, center, middle

# R Data Types

---

## The bare minimum needed for understanding today

Vector/variable
  - Type of vector (`int`, `num`, `chr`, `lgl`, `date`)

--

Data frame
  - Vectors of (potentially) different types
  - Each vector has the same number of rows

---

## The bare minimum needed for understanding today

```{r eval=FALSE}
library(tibble)
library(lubridate)
ex1 <- data_frame(
    vec1 = c(1980, 1990, 2000, 2010),
    vec2 = c(1L, 2L, 3L, 4L),
    vec3 = c("low", "low", "high", "high"),
    vec4 = c(TRUE, FALSE, FALSE, FALSE),
    vec5 = ymd(c("2017-05-23", "1776/07/04", "1983-05/31", "1908/04-01"))
  )
ex1
```

--

```{r echo=FALSE}
library(tibble)
library(lubridate)
ex1 <- data_frame(
    vec1 = c(1980, 1990, 2000, 2010),
    vec2 = c(1L, 2L, 3L, 4L),
    vec3 = c("low", "low", "high", "high"),
    vec4 = c(TRUE, FALSE, FALSE, FALSE),
    vec5 = ymd(c("2017-05-23", "1776/7/04", "1983-5/31", "1908/04-1"))
  )
ex1
```
  
---

class: center, middle  
  
## Welcome to the [tidyverse](https://blog.rstudio.org/2016/09/15/tidyverse-1-0-0/)!
  
The `tidyverse` is a collection of R packages that share common philosophies and are designed to work together. <br><br> 
  
<a href="http://tidyverse.tidyverse.org/logo.png"><img src="figure/tidyverse.png" style="width: 200px;"/><a>

---


## Beginning steps

Frequently the first thing to do when given a dataset is to
- identify the observational unit,
- specify the variables,
- give the types of variables you are presented with, and
- check that the data is <u>tidy</u>. (TO COME)

This will help with 
- choosing the appropriate plot, 
- summarizing the data, and 
- understanding which inferences/models can be applied.

---

class: inverse, center, middle
name: viz


# Data Visualization

<a href="http://gitsense.github.io/images/wealth.gif"><img src="figure/wealth.gif" style="width: 770px;"/></a>

Inspired by [Hans Rosling](https://www.youtube.com/watch?v=jbkSRLYSojo)

---

```{r echo=FALSE,fig.height=6, fig.width=10, fig.align='center'}
library(gapminder)
options(scipen = 99)
#gap_with_colors <-
#  data.frame(gapminder,
#             cc = I(country_colors[match(gapminder$country,
#                                         names(country_colors))]))

gapminder %>% filter(year == 1992) %>%
  ggplot(aes(x = log(gdpPercap, base = 10), y = lifeExp, color = continent,
             size = pop)) +
  geom_point() + xlab('Gross Domestic Product (log scale)') + ylab('Life Expectancy at birth (years)') + ggtitle("Gapminder for 1992")

#+
#  scale_color_manual(values = gapminder::continent_colors)
```


- What are the variables here?
- What is the observational unit?
- How are the variables mapped to aesthetics?

---


## Grammar of Graphics

.left-column[
Wilkinson (2005) laid out the proposed "Grammar of Graphics"
]

.right-column[
<a href="http://www.powells.com/book/the-grammar-of-graphics-9780387245447"><img src="figure/graphics.jpg" style="width: 300px;"></a>
]

---

## Grammar of Graphics in R

.left-column[
Wickham implemented the grammar in R in the `ggplot2` package
]

.right-column[
<a href="http://www.powells.com/book/ggplot2-elegant-graphics-for-data-analysis-9783319242750/68-428"><img src="figure/ggplot2.jpg" style="width: 300px;"></a>
]

---

## Grammar of Graphics elsewhere

- [Make plots online with plotly](https://plot.ly/)

- [Leland Wilkinson](https://research.tableau.com/user/leland-wilkinson) works at [Tableau](https://www.tableau.com/)

- The Grammar of Graphics provides a theoretical framework for building and deciphering statistical graphics

---

class: center, middle

# What is a statistical graphic?
--

## A `mapping` of <br> `data` variables
--

## to <br> `aes()`thetic attributes

--
## of <br> `geom_`etric objects.

---

class: inverse, center, middle

## Back to basics

---

### Consider the following data in tidy format:

```{r}
simple_ex <-
  data_frame(
    A = c(1980, 1990, 2000, 2010),
    B = c(1, 2, 3, 4),
    C = c(3, 2, 1, 2),
    D = c("low", "low", "high", "high")
  )
simple_ex
```

<!-- Copy to chalkboard/whiteboard -->

---

### Consider the following data in tidy format:

```{r echo=FALSE}
simple_ex %>% knitr::kable(format="html")
```


- Sketch the graphics below on paper, where the `x`-axis is variable `A` and the `y`-axis is variable `B`

1. <small>A scatter plot</small>
1. <small>A scatter plot where the `color` of the points corresponds to `D`</small>
1. <small>A scatter plot where the `size` of the points corresponds to `C`</small>
1. <small>A line graph</small>
1. <small>A line graph where the `color` of the line corresponds to `D` with points added that are all forestgreen of size 4.</small>

---

## Reproducing the plots in <small>`ggplot2`</small>

### 1. A scatterplot

```{r, eval=FALSE}
library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_point()
```
--

```{r, echo=FALSE, fig.height=4.5}
ggplot(data = simple_ex, aes(x = A, y = B)) + 
  geom_point()
```


---


## Reproducing the plots in <small>`ggplot2`</small>

### 2. A scatter plot where the `color` of the points corresponds to `group`

```{r, eval=FALSE}
library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_point(mapping = aes(color = D))
```
--

```{r, echo=FALSE, fig.height=4.5}
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_point(mapping = aes(color = D))
```


---

## Reproducing the plots in <small>`ggplot2`</small>

### 3. A scatter plot where the `size` of the points corresponds to `C`

```{r, eval=FALSE}
library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B, size = C)) + 
  geom_point()
```
--

```{r, echo=FALSE, fig.height=4.5}
ggplot(data = simple_ex, mapping = aes(x = A, y = B, size = C)) + 
  geom_point()
```


---

## Reproducing the plots in <small>`ggplot2`</small>

### 4. A line graph

```{r, eval=FALSE}
library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_line()
```
--

```{r, echo=FALSE, fig.height=4.5}
ggplot(data = simple_ex, aes(x = A, y = B)) + 
  geom_line()
```


---

## Reproducing the plots in <small>`ggplot2`</small>

### 5. A line graph where the `color` of the line corresponds to `D` with points added that are all blue of size 4.

```{r, eval=FALSE}
library(ggplot2)
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_line(mapping = aes(color = D)) +
  geom_point(color = "forestgreen", size = 4)
```
--

```{r, echo=FALSE, fig.height=4}
ggplot(data = simple_ex, mapping = aes(x = A, y = B)) + 
  geom_line(mapping = aes(color = D)) +
  geom_point(color = "forestgreen", size = 4)
```

---

# The Five-Named Graphs 

## The 5NG of data viz

- Scatterplot: `geom_point()`
- Line graph: `geom_line()`
--

- Histogram: `geom_histogram()`
- Boxplot: `geom_boxplot()`
- Bar graph: `geom_bar()`


---

class: center, middle

## More `ggplot2` examples

---

## Histogram

```{r fig.height=5.5}
library(nycflights13)
ggplot(data = weather, mapping = aes(x = humid)) +
  geom_histogram(bins = 20, color = "black", fill = "darkorange")
```

---

## Boxplot (broken)

```{r fig.height=5.5}
library(nycflights13)
ggplot(data = weather, mapping = aes(x = month, y = humid)) +
  geom_boxplot()
```

---


## Boxplot (almost fixed)

```{r fig.height=5.5}
library(nycflights13)
ggplot(data = weather, mapping = aes(x = month, group = month, y = humid)) +
  geom_boxplot() 
```
---

## Boxplot (fixed)

```{r fig.height=5.5}
library(nycflights13)
ggplot(data = weather, mapping = aes(x = month, group = month, y = humid)) +
  geom_boxplot() +
  scale_x_continuous(breaks = 1:12)
```

---

## Bar graph

```{r}
library(fivethirtyeight)
ggplot(data = bechdel, mapping = aes(x = clean_test)) +
  geom_bar()
```

---

## How about over time?

- Hop into `dplyr`

```{r}
library(dplyr)
year_bins <- c("'70-'74", "'75-'79", "'80-'84", "'85-'89",
               "'90-'94", "'95-'99", "'00-'04", "'05-'09",
               "'10-'13")
bechdel <- bechdel %>%
  mutate(five_year = cut(year, 
                         breaks = seq(1969, 2014, 5), 
                         labels = year_bins)) %>% 
  mutate(clean_test = factor(clean_test, 
                             levels = c("nowomen", "notalk", "men",
                                        "dubious", "ok")))
```

---

## How about over time? (Stacked)

```{r fig.width=11}
library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
       mapping = aes(x = five_year, fill = clean_test)) +
  geom_bar()
```

---

## How about over time? (Side-by-side)

```{r fig.width=11}
library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
       mapping = aes(x = five_year, fill = clean_test)) +
  geom_bar(position = "dodge")
```

---

## How about over time? (Stacked proportional)

```{r fig.width=11}
library(fivethirtyeight)
library(ggplot2)
ggplot(data = bechdel,
       mapping = aes(x = five_year, fill = clean_test)) +
  geom_bar(position = "fill", color = "black")
```

---

class: center, middle

## The `tidyverse`/`ggplot2` is for beginners and <br> for data science professionals!

<a href="https://fivethirtyeight.com/features/the-dollar-and-cents-case-against-hollywoods-exclusion-of-women/"><img src="figure/bechdel.png" style="width: 500px;"/><a>

---

## Practice

Produce appropriate 5NG with R package & data set in [ ], e.g., [`nycflights13` $\rightarrow$ `weather`] 

<!--
Try to look through the help documentation/Google to improve your plots
-->

1. Does `age` predict `recline_rude`? <br> [`fivethirtyeight` $\rightarrow$ `na.omit(flying)`]

2. Distribution of `age` by `sex` <br> [`okcupiddata` $\rightarrow$ `profiles`]

3. Does `budget` predict `rating`? <br> [`ggplot2movies` $\rightarrow$ `movies`]

4. Distribution of log base 10 scale of `budget_2013` <br> [`fivethirtyeight` $\rightarrow$ `bechdel`]

---

### HINTS

```{r echo=FALSE, fig.height=7, fig.width=10.5}
library(gridExtra)
library(fivethirtyeight)
library(ggplot2movies)
library(okcupiddata)

p1 <- ggplot(data = na.omit(flying), mapping = aes(fill = recline_rude, x = age)) + geom_bar(position = "fill") + ggtitle("Problem 1") + theme_gray(base_size = 20)

p2 <- ggplot(data = profiles, mapping = aes(x = sex, y = age)) +
  geom_boxplot() + ggtitle("Problem 2") + theme_gray(base_size = 20)

p3 <- ggplot(data = movies, mapping = aes(x = budget, y = rating)) +
  geom_point() + ggtitle("Problem 3") + theme_gray(base_size = 20)

p4 <- ggplot(data = bechdel, mapping = aes(x = log(budget_2013, 10))) +
  geom_histogram(color = "white", bins = 10) + ggtitle("Problem 4") +
  theme_gray(base_size = 20)

grid.arrange(p1, p2, p3, p4, ncol = 2, padding = unit(0.5, "line"),
             widths = c(2.6, 1.4))

```

---

class: inverse, center, middle

# DEMO of `ggplot2` in RStudio

---

class: center, middle

### Determining the appropriate plot

<a href="https://coggle.it/diagram/V_G2gzukTDoQ-aZt"><img src="figure/viz_mindmap.png" style="width: 400px;"/><a>

---

class: inverse, center, middle
name: wrangle

## Day 2

## Data Wrangling

---

### `gapminder` data frame in the `gapminder` package

```{r rows.print=15}
library(gapminder)
gapminder
```

---

## Base R versus the `tidyverse`

Say we wanted mean life expectancy across all years for Asia

--

```{r}
# Base R
asia <- gapminder[gapminder$continent == "Asia", ]
mean(asia$lifeExp)
```
--
 
```{r}
library(dplyr)
gapminder %>% filter(continent == "Asia") %>%
  summarize(mean_exp = mean(lifeExp))
```

---

## The pipe `%>%`

`r knitr::include_graphics("figure/pipe.png")` &emsp; &emsp;`r knitr::include_graphics("figure/MagrittePipe.jpg")`
--

- A way to chain together commands
--

- It is *essentially* the `dplyr` equivalent to the <br> `+` in `ggplot2`

---

## The 5NG of data viz
--

### `geom_point()`<br> `geom_line()` <br> `geom_histogram()`<br>  `geom_boxplot()`<br> `geom_bar()`

---

# The Five Main Verbs (5MV) of data wrangling

### `filter()` <br> `summarize()` <br> `group_by()` <br> `mutate()` <br> `arrange()`

---

## `filter()`

- Select a subset of the rows of a data frame. 

- The arguments are the "filters" that you'd like to apply.
--

```{r}
library(gapminder); library(dplyr)
gap_2007 <- gapminder %>% filter(year == 2007)
head(gap_2007)
```

- Use `==` to compare a variable to a value

---

## Logical operators

- Use `|` to check for any in multiple filters being true:
--

```{r eval=FALSE}
gapminder %>% 
  filter(year == 2002 | continent == "Europe")
```
--

```{r echo=FALSE}
gapminder %>% 
  filter(year == 2002 | continent == "Europe")
```

---

## Logical operators

- Use `&` or `,` to check for all of multiple filters being true:
--

```{r eval=FALSE}
gapminder %>% 
  filter(year == 2002, continent == "Europe")
```

```{r echo=FALSE}
gapminder %>% 
  filter(year == 2002, continent == "Europe")
```

---

## Logical operators

- Use `%in%` to check for any being true <br> (shortcut to using `|` repeatedly with `==`)
--

```{r eval=FALSE}
gapminder %>% 
  filter(country %in% c("Argentina", "Belgium", "Mexico"),
         year %in% c(1987, 1992))
```
--

```{r echo=FALSE}
gapminder %>% 
  filter(country %in% c("Argentina", "Belgium", "Mexico"),
         year %in% c(1987, 1992))
```


---

## `summarize()`

- Any numerical summary that you want to apply to a column of a data frame is specified within `summarize()`.

```{r eval=FALSE}
max_exp_1997 <- gapminder %>% 
  filter(year == 1997) %>% 
  summarize(max_exp = max(lifeExp))
max_exp_1997
```
--

```{r echo=FALSE}
max_exp_1997 <- gapminder %>% 
  filter(year == 1997) %>% 
  summarize(max_exp = max(lifeExp))
max_exp_1997
```



---

### Combining `summarize()` with `group_by()`

When you'd like to determine a numerical summary for all
levels of a different categorical variable

```{r eval=FALSE}
max_exp_1997_by_cont <- gapminder %>% 
  filter(year == 1997) %>% 
  group_by(continent) %>%
  summarize(max_exp = max(lifeExp))
max_exp_1997_by_cont
```

--
```{r echo=FALSE}
max_exp_1997_by_cont <- gapminder %>% 
  filter(year == 1997) %>% 
  group_by(continent) %>%
  summarize(max_exp = max(lifeExp))
max_exp_1997_by_cont
```

---

### Without the `%>%`

It's hard to appreciate the `%>%` without seeing what the code would look like without it:

```{r}
max_exp_1997_by_cont <- 
  summarize(
    group_by(
      filter(
        gapminder, 
          year == 1997), 
      continent),
    max_exp = max(lifeExp))
max_exp_1997_by_cont
```



---

## `ggplot2` revisited

For aggregated data, use `geom_col`

```{r fig.height=4.5}
ggplot(data = max_exp_1997_by_cont, 
       mapping = aes(x = continent, y = max_exp)) +
  geom_col()
```
---


## The 5MV

- `filter()`
- `summarize()`
- `group_by()`
--

- `mutate()`

--
- `arrange()`

---

## `mutate()`

- Allows you to 
    1. <font color="blue">create a new variable with a specific value</font> OR
    2. create a new variable based on other variables OR
    3. change the contents of an existing variable

--

```{r}
gap_plus <- gapminder %>% mutate(just_one = 1)
head(gap_plus)
```

---

## `mutate()`

- Allows you to 
    1. create a new variable with a specific value OR
    2. <font color="blue">create a new variable based on other variables</font> OR
    3. change the contents of an existing variable

--

```{r}
gap_w_gdp <- gapminder %>% mutate(gdp = pop * gdpPercap)
head(gap_w_gdp)
```

---

## `mutate()`

- Allows you to 
    1. create a new variable with a specific value OR
    2. create a new variable based on other variables OR
    3. <font color="blue">change the contents of an existing variable</font>

--

```{r}
gap_weird <- gapminder %>% mutate(pop = pop + 1000)
head(gap_weird)
```

---

## `arrange()`

- Reorders the rows in a data frame based on the values of one or more variables
--

```{r}
gapminder %>%
  arrange(year, country)
```

---

## `arrange()`

- Can also put into descending order
--

```{r desc}
gapminder %>%
  filter(year > 2000) %>%
  arrange(desc(lifeExp))
```

---

## Don't mix up `arrange` and `group_by`

- `group_by` is used (mostly) with `summarize` to calculate summaries over groups

- `arrange` is used for sorting

---

## Don't mix up `arrange` and `group_by`

This doesn't really do anything useful

```{r rows.print=10}
gapminder %>% group_by(year)
```

---

## Don't mix up `arrange` and `group_by`

But this does

```{r rows.print=10}
gapminder %>% arrange(year)
```

---

## Changing of observation unit

True or False
> Each of `filter`, `mutate`, and `arrange` change the observational unit.

--

True or False
> `group_by() %>% summarize()` changes the observational unit.

<!-- 
Draw diagram for average monthly temp aggregated like on rstudio::conf slides
-->

---

## What is meant by "joining data frames" and <br> why is it useful?

--

```{r echo=FALSE, fig.align='center'}
knitr::include_graphics("https://ismayc.github.io/moderndiver-book/images/join-inner.png")
```

---

### Does cost of living in a state relate to whether police officers live in the cities they patrol?  What about state political ideology?


```{r eval=FALSE}
library(fivethirtyeight); library(readr); library(dplyr)
police2 <- police_locals %>% select(city, all)
ideology <- read_csv("https://ismayc.github.io/Effective-Data-Storytelling-using-the-tidyverse/datasets/ideology.csv")
police_join <- inner_join(x = police2, y = ideology, by = "city")
police_join
```

--

```{r echo=FALSE}
library(fivethirtyeight); library(readr); library(dplyr)
police2 <- police_locals %>% select(city, all)
ideology <- read_csv("https://ismayc.github.io/Effective-Data-Storytelling-using-the-tidyverse/datasets/ideology.csv")
police_join <- inner_join(x = police2, y = ideology, by = "city")
police_join
```


---

```{r}
url <- paste0("https://ismayc.github.io/",
              "Effective-Data-Storytelling-using-the-tidyverse/",
              "datasets/cost_of_living.csv")
cost_of_living <- read_csv(url)
police_join_cost <- inner_join(
    x = police_join, 
    y = cost_of_living, 
    by = "state"
  )
police_join_cost %>% select(-state) %>% arrange(city)
```

---

### Does cost of living in a state relate to whether police officers live in the cities they patrol?  What about state political ideology?

```{r fig.width=10}
ggplot(data = police_join_cost,
       mapping = aes(x = index, y = all)) +
  geom_point(mapping = aes(color = state_ideology)) +
  labs(x = "Cost of Living Index", y = "% Officers Living in City")
```

---

## Practice <small><small>(Lay out what the resulting table should look like on paper first.)</small></small>

Use the 5MV to answer problems from R data packages, e.g., [`nycflights13` $\rightarrow$ `weather`] 

1. What is the maximum arrival delay for each carrier departing JFK? [`nycflights13` $\rightarrow$ `flights`]

2. Determine the top five movies in terms of domestic return on investment for 2013 scaled data<br> [`fivethirtyeight` $\rightarrow$ `bechdel`]

3. [Include the name of the destination airport as a column in the `flights` data frame](http://r4ds.had.co.nz/diagrams/relational-nycflights.png) <br> [`nycflights13` $\rightarrow$ `flights`, `airports`]

---


class: inverse, center, middle

# DEMO of `dplyr` in RStudio

---

<!--
## Homework for tomorrow

- Review the Data Import cheatsheet and use the `readr`, `readxl`, and `haven` packages to read in data from CSVs, Excel XLS/XLSX files, STATA/SPSS files

- Use `ggplot2` and `dplyr` to make graphics and produce data wrangling on data sets of interest to you read in from the previous step
-->

class: inverse, center, middle


# Data Importing and Tidying

---

## `tidyverse` packages

**IMPORT**

- `haven` for SPSS, SAS, and Stata data files
- `jsonlite` for JSON files
- `readxl` for XLS and XLSX files
- `readr` for CSV and TSV files (and R specific RDS files)

<br>

**TIDYING**

- `tidyr` for converting "messy" into "tidy" data frames

---

name: import

## Basics of Importing

<small>We will begin by downloading and importing a variety of different "messy" data sets.  You can download all of them in a ZIP file at <http://bit.ly/reed-messy-data>.  The links below go to the original sources.  I've converted these original sources into different formats.</small>


- [Life Expectancy by year for countries since 1951 (CSV)](https://spreadsheets.google.com/pub?key=phAwcNAVuyj2tPLxKvvnNPA&output=xls)

- [World Health Organization TB data (Stata DTA)](http://www.who.int/tb/country/data/download/en/)

- [Annual Estimates of Population by Age, Sex, Race, and Hispanic Origin for Oregon Counties (XLSX)](https://www.census.gov/popest/data/counties/asrh/2011/CC-EST2011-alldata.html)
- [Educational attainment for the U.S., States, and counties, 1970-2014 (JSON)](http://www.ers.usda.gov/data-products/county-level-data-sets/download-data.aspx)
- [County level results for 2012 POTUS election from The Guardian (SPSS SAV)](https://fusiontables.google.com/DataSource?docid=1qcQLqrAIAe3RcEfdWSm_QcXMLmteVg4uSpSs1rM)


---

class: center, middle, inverse

## Demonstration in RStudio

---

## Practice

- Download the four other files and import them into R using the appropriate package
    - You may need to check out the help pages for the different packages
        - [`haven` for SPSS, SAS, and Stata data files](http://haven.tidyverse.org/)
        - [`jsonlite` for JSON files](https://cran.r-project.org/web/packages/jsonlite/vignettes/json-aaquickstart.html)
        - [`readxl` for XLS and XLSX files](https://github.com/hadley/readxl)
        - [`readr` for CSV/TSV files (& R specific RDS files)](http://readr.tidyverse.org/)
    - Give them the following names: `who_df`, `county_pop_df`, `edu_county_df`, and `potus12_df`
  
---

name: tidy
class: inverse, middle, center

# Tidy Data


---

<img src="http://garrettgman.github.io/images/tidy-1.png" style="width: 750px;"/>

1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

The third point means we don't mix apples and oranges.

---

## What is Tidy Data?

1. Each observation forms a row. In other words, each row corresponds to a single instance of an <u>observational unit</u>
1. Each variable forms a column:
    + Some variables may be used to identify the <u>observational units</u>. 
    + For organizational purposes, it's generally better to put these in the left-hand columns
1. Each type of observational unit forms a table with the entries in the table corresponding to values.

---

## Differentiating between <u>neat</u> data and <u>tidy</u> data

- Colloquially, they mean the same thing
- But in our context, one is a subset of the other. 

<br>

<u>Neat</u> data is 
  - easy to look at, 
  - organized nicely, and 
  - in table form.

--

<u>Tidy</u> data is neat but also abides by a set of three rules.

---

class: center, middle

<a href="http://stream1.gifsoup.com/view8/20150404/5192859/lebowski-abides-o.gif"><img src="figure/lebowski-abides-o.gif" style="width: 450px;"/></a>


<img src="figure/tidy-1.png" alt="Drawing" style="width: 750px;"/>

---

## Is this tidy?

```{r echo=FALSE, message=FALSE, warning=FALSE}
library(fivethirtyeight)
set.seed(2)
bechdel %>% sample_n(12) %>%
  select(year, title, clean_test, budget_2013) %>%
  arrange(title) %>%
  kable(format = "html")
```


---

name: demscore

## How about this? Is this tidy?

```{r echo=FALSE, message=FALSE, warning=FALSE}
dem_score <- read_csv("data/dem_score.csv")
knitr::kable(dem_score %>% slice(1:12), format = "html")
```

---

name: whytidy

## Why is tidy data important?

- Think about trying to plot democracy score across years in the simplest way possible.
--

- It would be much easier if the data looked like what follows instead so we could put 
    - `year` on the `x`-axis and 
    - `dem_score` on the `y`-axis.

---

## Tidy is good

```{r echo=FALSE}
dem_score_tidy <- dem_score %>% 
  gather(-country, key = "year", value = "dem_score") %>% 
  mutate(year = as.numeric(year)) 
set.seed(2)
dem_score_tidy %>% sample_n(13) %>% arrange(country) %>% 
  knitr::kable(format = "html")
```

---

## Let's plot it

- Plot the line graph for three countries using `ggplot`

```{r}
dem_score4 <- dem_score_tidy %>%
  filter(country %in% c("Pakistan", "Portugal", "Uruguay"))
ggplot(data = dem_score4, mapping = aes(x = year, y = dem_score)) +
  geom_line(mapping = aes(color = country))
```

---

## Tidying

### The Life Expectancy by year data

```{r}
library(readr)
life_exp_df <- read_csv("data/le_mess.csv")
#View(life_exp_df)
```

```{r echo=FALSE}
knitr::include_graphics("figure/le_mess.png")
```

---

## Doing the "tidying"/reshaping

```{r message=FALSE}
library(tidyr)
library(dplyr)
(life_exp_tidy <- life_exp_df %>% 
    gather(key = "year", value = "life_exp", -country))
```

---

## Tidying

#### World Health Organization TB data (Stata DTA)


```{r}
library(haven)
who_df <- read_dta("data/who.dta")
#View(who_df)
```

```{r echo=FALSE}
knitr::include_graphics("figure/who_mess.png")
```

---

## WHO ugly...

- This data set contains strange variable names that will require us to look up their meaning in the corresponding [**data dictionary**](https://extranet.who.int/tme/generateCSV.asp?ds=dictionary).

--

- What do we know?
     - `country`, `iso2`, and `iso3` are redundant and identify the country
     - `year` is a variable and it looks like it corresponds to each country
- But what in the world is `new_sp_m014`? And the other columns?...

---

## First step

Like before, we can `gather` these non-`country` and non-`year` variables together:

```{r}
who_tidy <- who_df %>% 
  gather(new_sp_m014:newrel_f65, key = "key", value = "value")
```

We can now see what this has done to the data frame:

```{r eval=FALSE}
View(who_tidy)
```

---

## Data dictionary saves us some...

1. The first three letters of entries in the `key` column correspond to `new` or `old` cases of TB.
2. The next two letters (after the `_`) correspond to TB type:
    - `rel` for relapse, `ep` for extrapulmonary TB
    - `sn` for smear negative, `sp` for smear positive
3. The next letter after the second `_` corresponds to the sex of the TB patient.
4. The remaining numbers correspond to age group:
    - `014` for 0 to 14 years
    - ...
    - `65` for 65 or older

---

- Looking over the entries of `key` in `who_tidy`, do you see anything else that doesn't match the pattern?

- It may be easier to remember that the entries of `key` correspond to column names in `who_df`:

```{r}
names(who_df)
```

---

## A fix using `stringr`

You can replace all of the entries in `key` that contained `newrel` with `new_rel` via

```{r}
library(stringr)
library(dplyr)
who_tidy <- who_tidy %>% 
  mutate(key = str_replace(
      string = key, 
      pattern = "newrel", 
      replacement = "new_rel")
    )
```

---

## Pulling apart variables

The entry `new_sp_m014` is actually four different variables.  Use the `separate` function to pull them apart:

```{r}
who_tidy <- who_tidy %>% 
  separate(col = key, into = c("new", "type", "sexage"), sep = "_")
who_tidy
```

---

## One more pull apart

- We need to pull `sexage` into two different variables.  
- If you use `?separate`, you'll see that the following is an option:

```{r}
(who_tidy <- who_tidy %>% 
  separate(col = sexage, into = c("sex", "age"), sep = 1))
```

---

## Final cleaning

- `iso2` and `iso3` are redundant if we have `country`
- `new` is constant since this only has new cases of TB
- Let's just ignore missing values here
- We also know that `value` is actually `cases` so we can `rename` that column.

```{r, tidy=FALSE}
who_tidy <- who_tidy %>% select(-iso2, -iso3, -new) %>% 
  na.omit() %>% rename("cases" = value)
head(who_tidy, 5)
```

---

## The power of the `%>%` (pipe)

Everything we did above can be chained into one sequence:


```{r eval=FALSE}
who_tidy <- who_df %>% 
  gather(new_sp_m014:newrel_f65, key = "key", value = "value") %>%  
  mutate(key = str_replace(key, pattern = "newrel", 
                           replacement = "new_rel")) %>% 
  separate(col = key, into = c("new", "type", "sexage"), sep = "_") %>% 
  separate(col = sexage, into = c("sex", "age"), sep = 1) %>% 
  select(-iso2, -iso3, -new) %>% 
  na.omit() %>% 
  rename("cases" = value)
```

---

## BONUS

### A Gapminder tidy dataset read in from a Google Sheet!

- I've updated the `gapminder` data set available in the `gapminder` R data package by Jenny Bryan [here](https://github.com/jennybc/gapminder/).  Jenny provides [instructions](https://github.com/jennybc/gapminder/tree/master/data-raw) for recreating the data.

---

## BONUS

- You can download the updated data from Google Sheets by running the following in R:

```{r eval=FALSE}
# Link is https://docs.google.com/spreadsheets/d/18L5ZiXd1CQ97XWSqb04x1YQ4ncsEOdR7eHzgX0ZIuPA/edit?usp=sharing
library(googlesheets)
updated_gap <- gs_key("18L5ZiXd1CQ97XWSqb04x1YQ4ncsEOdR7eHzgX0ZIuPA") %>%
  gs_read()
```


- A script going through the steps to take the "messy" original Gapminder.org files and turn them into this tidy dataset is available [here](https://raw.githubusercontent.com/ismayc/tidyverse_workshops/master/importing_tidying/data/cleaning.R).

---

## Practice (after the bootcamp)

Try to tidy the other data sets you downloaded and imported or other data sets you have!

---

name: model
class: inverse, center, middle

# Data Modeling


---


## Modeling

1. Experience with `ggplot2` package and knowledge of the Grammar of Graphics primes students for modeling
1. Use of the `broom` package to unpack regression output into a tidy format

---

## 1. `ggplot2` Primes Regression

Example:

* All Alaskan Airlines and Frontier flights leaving NYC in 2013
* We want to study the relationship between temperature and departure delay
* For summer (June, July, August) and non-summer months separately

Involves four variables: 

- `carrier`, `temp`, `dep_delay`, `summer`

---

## 1. `ggplot2` Primes Regression

```{r}
flights_subset <- flights %>% 
  filter(carrier == "AS" | carrier == "F9") %>% 
  left_join(weather, by=c("year", "month", "day", "hour", "origin")) %>% 
  filter(dep_delay < 250) %>% 
  mutate(summer = ifelse(month == 6 | month == 7 | month == 8, "Summer Flights", "Non-Summer Flights"))
```

---

```{r, fig.height=6, fig.width=10.5}
ggplot(data = flights_subset, 
    mapping = aes(x = temp, y = dep_delay, color = carrier)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ summer)
```

---

## 1. `ggplot2` Primes Regression

Why? Dig deeper into data. Look at `origin` and `dest` variables as well:

<br> 

```{r}
flights_subset %>% 
  group_by(carrier, origin, dest) %>% 
  summarise(`Number of Flights` = n())
```

---


## 2. `broom` Package

* The `broom` package takes the messy output of built-in modeling functions in R, such as
`lm`, `nls`, or `t.test`, and turns them into tidy data frames.
* Fits in with `tidyverse` ecosystem
* This works for [many R data types](https://github.com/tidyverse/broom#available-tidiers)!

---

## 2. `broom` Package

In our case, `broom` functions take `lm` objects as inputs and return the following in tidy format!

* `tidy()`: regression output table
* `augment()`: point-by-point values (fitted values, residuals, predicted values)
* `glance()`: scalar summaries like <large> $R^2$ </large>, 

---

## 2. `broom` Package

Example

```{r}
library(tidyverse)
library(nycflights13)
library(knitr)
library(broom)
set.seed(2017)

# Load Alaska data, deleting rows that have missing departure delay
# or arrival delay data
alaska_flights <- flights %>% 
  filter(carrier == "AS") %>% 
  filter(!is.na(dep_delay) & !is.na(arr_delay)) %>% 
  sample_n(50)
```

---

```{r}
# Exploratory Data Analysis-----------------------------------------
# Plot of sample of points:
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
   geom_point()
```

---

```{r}
# Correlation coefficient:
alaska_flights %>% summarize(correl = cor(dep_delay, arr_delay))

# Add regression line
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red")
```

---

```{r}
# Fit Regression and Study Output with broom Package-------------------
# Fit regression
options(scipen = 6) # Set scientific notation
delay_fit <- lm(formula = arr_delay ~ dep_delay, data = alaska_flights)

# 1. broom::tidy() regression table with confidence intervals 
# and no p-value stars
regression_table <- delay_fit %>% 
  tidy(conf.int = TRUE)
regression_table
```

---

```{r}
# 2. broom::augment() for point-by-point values
regression_points <- delay_fit %>% 
  augment() %>% 
  select(arr_delay, dep_delay, .fitted, .resid) 
regression_points
```

---

```{r}
# and for prediction
new_flights <- data_frame(dep_delay = c(25, 30, 15))
delay_fit %>% 
  augment(newdata = new_flights)
```

---

```{r}
# 3. broom::glance() scalar summaries of regression
regression_summaries <- delay_fit %>% 
  glance() 
regression_summaries
```

---

```{r}
# Residual Analysis------------------------------------------------------
ggplot(data = regression_points, mapping = aes(x = .resid)) +
  geom_histogram(binwidth=10, color = "white") +
  geom_vline(xintercept = 0, color = "blue")
```

---

```{r}
ggplot(data = regression_points, mapping = aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 0, color = "blue")
```

---

```{r}
ggplot(data = regression_points, mapping = aes(sample = .resid)) +
  stat_qq()
```

---

class: center, middle, inverse
name: rmarkdown

# Data Communicating


---

## What is Markdown?

 - A "plaintext formatting syntax"
 - Type in plain text, render to more complex formats
 - One step beyond writing a `txt` file
 - Render to HTML, PDF, DOCX, etc. using Pandoc

---

## What does it look like?

.left-column[
```
  # Header 1
  
  ## Header 2
  
  Normal paragraphs of text go here.
  
  **I'm bold**
  
  [links!](http://rstudio.com)
  
   * Unordered
   * Lists   
   
  And  Tables
  ---- -------
  Like This
  
```
]

.right-column[
<img src="figure/markdown.png" alt="markdown" style="width: 270px;"/>
]

---

## What is R Markdown?
  
- "Literate programming"
- Embed R code in a Markdown document
- Renders textual output along with graphics

***

.left-column[
```

```{r chunk1}
library(ggplot2)
library(nycflights13)
pdx_flights <- flights %>% 
  filter(dest == "PDX", month == 5)
nrow(pdx_flights)
```

```{r chunk2}
ggplot(data = pdx_flights,
  mapping = aes(x = arr_delay, 
                y = dep_delay)) +
  geom_point()
```

```
]

.right-column[
```{r, fig.width=4.5, fig.height=4, echo=FALSE}
library(ggplot2)
library(nycflights13)
pdx_flights <- flights %>% 
  filter(dest == "PDX", month == 5)
nrow(pdx_flights)
ggplot(data = pdx_flights,
  mapping = aes(x = arr_delay, 
                y = dep_delay)) +
  geom_point()
```
]

---

## Practice

Turn a statistical analysis you have conducted into an R Markdown document.

- You can also publish online easily to <http://rpubs.com>

---

class: middle

# Thanks for attending!

- Slides created via the R package [xaringan](https://github.com/yihui/xaringan) by Yihui Xie
- Source code for these slides and the webpage at <https://github.com/ismayc/poRtland-bootcamp17>